Getting started
We have provided an .Rmd file for this activity: R/03-regression-activity.Rmd. We recommend that you avoid knitting often and rely more on the Run Current Chunk feature, as it will take some time to simulate the Bayesian models.
If you prefer to use an R file, first make sure that you’re working within your directory with the repo files, and then run the following from the Console:
knitr::purl("R/03-regression-activity.Rmd", "R/03-regression-activity.R")
This will create an R file with the activity R code. For the corresponding discussion and prompts, see the R/03-regression-activity.html file in the repo.
# Load packages
library(bayesrules)
library(tidyverse)
library(rstanarm)
library(bayesplot)
library(janitor)
Capital Bikeshare is a bike sharing service in the Washington, D.C. area. Using the bikes data, our analysis will focus on the relationship between number of bikeshare rides (\(Y\): rides), and temperature (\(X\): temp_feel). We will model:
\[\mu_i = \beta_0 + \beta_1X_i\]
\(\beta_0\) is the typical ridership on days in which the temperature was 0 degrees ( \(X_i\)=0). It is not interpretable in this case.
\(\beta_1\) is the typical change in ridership for every one unit increase in temperature.
likelihood: \(Y_i | \beta_0, \beta_1, \sigma \;\;\;\stackrel{ind}{\sim} N\left(\mu_i, \sigma^2\right)\text{ with } \mu_i = \beta_0 + \beta_1X_i\)
prior models:
\(\beta_0\sim N(m_0, s_0^2 )\)
\(\beta_1\sim N(m_1, s_1^2 )\)
\(\sigma \sim \text{Exp}(l)\)
Since we have done some hands-on activities before this related to setting priors, in this activity we will follow along with some uninformative priors together and focus mostly on the posterior.
bike_model <- stan_glm(rides ~ temp_feel, data = bikes,
family = gaussian,
prior_intercept = normal(5000, 1000),
prior = normal(100, 40),
prior_aux = exponential(0.0008),
chains = 4, iter = 5000*2, seed = 84735, refresh = FALSE)
Exercise: stan_glm()
Can you identify each argument of the stan_glm() function and what their purpose is?
bayesplot package provides many plotting functions that can help us quickly examine the MCMC draws.
mcmc_trace(bike_model, size = 0.1)
Exercise: traceplots
Do the traceplots provide evidence that chains are stable?
mcmc_dens(bike_model)
Exercise: density plots
Looking at the density plots of the posterior simulation what can you conclude about the intercept and the slope? Do you think we have enough evidence to believe that the relationship between ridership and temperature is a positive one?
Our goal in this part is to simulate 20,000 predictions of ridership on a 75 degree day. First, let’s get our model results into a data frame.
# Store the 4 chains for each parameter in 1 data frame
bike_model_df <- as.data.frame(bike_model)
# Check it out
nrow(bike_model_df)
## [1] 20000
head(bike_model_df, 3)
## (Intercept) temp_feel sigma
## 1 -2656.894 88.16061 1323.364
## 2 -2188.090 83.01252 1322.962
## 3 -1983.816 81.53971 1363.346
We have 20000 rows one for each iteration and 3 columns one for each parameter. Let’s focus on the first row of this data
first_set <- head(bike_model_df, 1)
first_set
## (Intercept) temp_feel sigma
## 1 -2656.894 88.16061 1323.364
Under this particular scenario, \(\left(\beta_0^{(1)}, \beta_1^{(1)}, \sigma^{(1)}\right) = (-2657, 88.16, 1323)\), the average ridership at a given temperature is defined by
\[\mu = \beta_0^{(1)} + \beta_1^{(1)} X = -2657 + 88.16X \; .\]
As such, we’d expect an average of \(\mu = 3955\) riders on a 75 degree day:
mu <- first_set$`(Intercept)` + first_set$temp_feel * 75
mu
## [1] 3955.152
To capture the sampling variability around this average, i.e. the fact that not all 75 degree days have the same ridership, we can simulate our first official prediction \(Y_{\text{new}}^{(1)}\) by taking a random draw from the Normal model specified by this first parameter set:
\[Y_{\text{new}}^{(1)} | \beta_0, \beta_1, \sigma \; \sim \; N\left(3955, 1323^2\right) \; .\]
Taking a draw from this model using rnorm(), we happen to observe an above average 4838 rides on the 75 degree day:
set.seed(84735)
y_new <- rnorm(1, mean = mu, sd = first_set$sigma)
y_new
## [1] 4838.144
Now let’s do this 19,999 more times. That is, let’s follow the same two-step process to simulate a prediction of ridership from each of the 20,000 sets of regression parameters \(i\) in bike_model_df: (1) calculate the average ridership on 75 degree days, \(\mu^{(i)} = \beta_0^{(i)} + \beta_1^{(i)}\cdot 75\); then (2) sample from the Normal model centered at this average with standard deviation \(\sigma^{(i)}\):
# Predict rides for each parameter set in the chain
set.seed(84735)
predict_75 <- bike_model_df %>%
mutate(mu = `(Intercept)` + temp_feel*75,
y_new = rnorm(20000, mean = mu, sd = sigma))
head(predict_75, 3)
## (Intercept) temp_feel sigma mu y_new
## 1 -2656.894 88.16061 1323.364 3955.152 4838.144
## 2 -2188.090 83.01252 1322.962 4037.849 3874.013
## 3 -1983.816 81.53971 1363.346 4131.662 5196.255
# Plot the posterior model of the typical ridership on 75 degree days
ggplot(predict_75, aes(x = mu)) +
geom_density()
# Plot the posterior predictive model of tomorrow's ridership
ggplot(predict_75, aes(x = y_new)) +
geom_density()
Exercise: evaluating predictions
Now that we have predictions for ridership for a day, how would you assess if our prediction is a good one or not? Remember that, all models are wrong1. It might make more sense to discuss how you would assess how bad the prediction is.
We can also use built-in functions to make posterior predictions. This will make life a little easier because we can easily make predictions for days that have temperature other than 75.
predictions <- posterior_predict(bike_model, newdata = bikes)
If you take a look at the predictions object you will see that for each of the 500 observed temp_feel values there are 20,000 posterior predictions of rides. We can compute 50% and 95% posterior predictive credible intervals for each \(x\) (temp_feel) with the ppc_intervals() function. To start understanding the plot, you may first focus on when \(x = 75\) since this is the case, we have computed the posterior prediction from scratch.
Exercise: model evaluation
Based on the plot below, how bad do you think the model is?
ppc_intervals(bikes$rides, yrep = predictions, x = bikes$temp_feel,
prob = 0.5, prob_outer = 0.95)
You might be wondering how many of these observed values are falling in their respective 50% and 95% credible intervals. Well, prediction_summary from bayesrules has an answer for this.
prediction_summary(model = bike_model, data = bikes,
prob_inner = 0.5, prob_outer = 0.95)
## mae mae_scaled within_50 within_95
## 1 993.7752 0.7713142 0.436 0.968
You might also be wondering what median absolute error mae is. We expect to be short on time for this part of the workshop so make sure to follow up on Model Evaluation in the book.
Please take a few minutes to plan what is ahead for you in your teaching and discuss in your breakout room.
George Box↩︎